233 research outputs found
Efficient Logging in Non-Volatile Memory by Exploiting Coherency Protocols
Non-volatile memory (NVM) technologies such as PCM, ReRAM and STT-RAM allow
processors to directly write values to persistent storage at speeds that are
significantly faster than previous durable media such as hard drives or SSDs.
Many applications of NVM are constructed on a logging subsystem, which enables
operations to appear to execute atomically and facilitates recovery from
failures. Writes to NVM, however, pass through a processor's memory system,
which can delay and reorder them and can impair the correctness and cost of
logging algorithms.
Reordering arises because of out-of-order execution in a CPU and the
inter-processor cache coherence protocol. By carefully considering the
properties of these reorderings, this paper develops a logging protocol that
requires only one round trip to non-volatile memory while avoiding expensive
computations. We show how to extend the logging protocol to building a
persistent set (hash map) that also requires only a single round trip to
non-volatile memory for insertion, updating, or deletion
Fine-Grain Checkpointing with In-Cache-Line Logging
Non-Volatile Memory offers the possibility of implementing high-performance,
durable data structures. However, achieving performance comparable to
well-designed data structures in non-persistent (transient) memory is
difficult, primarily because of the cost of ensuring the order in which memory
writes reach NVM. Often, this requires flushing data to NVM and waiting a full
memory round-trip time.
In this paper, we introduce two new techniques: Fine-Grained Checkpointing,
which ensures a consistent, quickly recoverable data structure in NVM after a
system failure, and In-Cache-Line Logging, an undo-logging technique that
enables recovery of earlier state without requiring cache-line flushes in the
normal case. We implemented these techniques in the Masstree data structure,
making it persistent and demonstrating the ease of applying them to a highly
optimized system and their low (5.9-15.4\%) runtime overhead cost.Comment: In 2019 Architectural Support for Programming Languages and Operating
Systems (ASPLOS 19), April 13, 2019, Providence, RI, US
Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism
The demise of Moore's Law and Dennard Scaling has revived interest in
specialized computer architectures and accelerators. Verification and testing
of this hardware heavily uses cycle-accurate simulation of
register-transfer-level (RTL) designs. The best software RTL simulators can
simulate designs at 1--1000~kHz, i.e., more than three orders of magnitude
slower than hardware. Faster simulation can increase productivity by speeding
design iterations and permitting more exhaustive exploration.
One possibility is to use parallelism as RTL exposes considerable fine-grain
concurrency. However, state-of-the-art RTL simulators generally perform best
when single-threaded since modern processors cannot effectively exploit
fine-grain parallelism.
This work presents Manticore: a parallel computer designed to accelerate RTL
simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution
model to eliminate runtime synchronization barriers among many simple
processors. Manticore relies entirely on its compiler to schedule resources and
communication. Because RTL code is practically free of long divergent execution
paths, static scheduling is feasible. Communication and synchronization no
longer incur runtime overhead, enabling efficient fine-grain parallelism.
Moreover, static scheduling dramatically simplifies the physical
implementation, significantly increasing the potential parallelism on a chip.
Our 225-core FPGA prototype running at 475 MHz outperforms a state-of-the-art
RTL simulator on an Intel Xeon processor running at 3.3 GHz by up to
27.9 (geomean 5.3) in nine Verilog benchmarks
Sirocco: cost-effective fine-grain distributed shared memory
Software fine-grain distributed shared memory (FGDSM) provides a simplified shared-memory programming interface with minimal or no hardware support. Originally software FGDSMs targeted uniprocessor-node parallel machines. This paper presents Sirocco, a family of software FGDSMs implemented on a network of low-cost SMPs. Sirocco takes full advantage of SMP nodes by implementing inter-node sharing directly in hardware and overlapping computation with protocol execution. To maintain correct shared-memory semantics, however SMP nodes require mechanisms to guarantee atomic coherence operations. Multiple SMP processors may also result in contention for shared resources and reduce performance. SMP nodes also impact the cost trade-off. While SMPs typically charge higher price-premiums, for a given system size SMP nodes substantially reduce networking hardware requirement as compared to uniprocessor nodes. In this paper, we ask the question âAre SMPs cost-effective building blocks for software FGDSM?â We present experimental measurements on Sirocco implementations ranging from an all-software system to a system with minimal hardware support. Together with simple cost models we show that low-cost SMP nodes: (i) result in competitive performance with uniprocessor nodes, (ii) substantially reduce hardware requirement and are more cost- effective than uniprocessor nodes, (iii) significantly benefit from hardware support for coherence operations, and (iv) are especially beneficial for FGDSMs with high-overhead coherence operation
Typing Copyless Message Passing
We present a calculus that models a form of process interaction based on
copyless message passing, in the style of Singularity OS. The calculus is
equipped with a type system ensuring that well-typed processes are free from
memory faults, memory leaks, and communication errors. The type system is
essentially linear, but we show that linearity alone is inadequate, because it
leaves room for scenarios where well-typed processes leak significant amounts
of memory. We address these problems basing the type system upon an original
variant of session types.Comment: 50 page
Mechanisms for cooperative shared memory
This paper explores the complexity of implementing directory protocols by examining their mechanisms - primitive operations on directories, caches, and network interfaces. We compare the following protocols: Dir1B, Dir4B, Dir4NB, DirnNB, Dir1SW and an improved version of Dir1SW (Dir1SW+). The comparison shows that the mechanisms and mechanism sequencing of Dir1SW and Dir1SW+ are simpler than those for other protocols. We also compare protocol performance by running eight benchmarks on 32 processor systems. Simulations show that Dir1SW+'s performance is comparable to more complex directory protocols. The significant disparity in hardware complexity and the small difference in performance argue that Dir1SW+ may be a more effective use of resources. The small performance difference is attributable to two factors: the low degree of sharing in the benchmarks and Check-In/Check-Out (CICO) directives
- âŠ